Marco Huang (0201), Jingyun Li (0101)
COVID-19 is the disease caused by SARS-CoV-2, the coronavirus that emerged in December 2019. COVID-19 can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness. The coronavirus can be spread from person to person. It is diagnosed with a test.
Three years after the break out of the corona virus, the growth of the COVID confirmation rate seems to slow down, providing us with the best timing to examine the pandemic as a whole. Here in part one, we would like to look at COVID in the United States, and several representative states in specific, and discuss what the data illustrates to us. For part two, we're going to look at the relation between confirmation cases and housing prices.
In this project the Covid-19 data we used comes from Johns Hopkins University and is available at this link: https://github.com/CSSEGISandData/COVID-19
We used the following tools to collect this data: pandas, numpy, matplotlib, scikit-learn, seaborn, os, folium, and more.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
import os
import scipy.stats as stats
from statsmodels.formula.api import ols as o
from sklearn import linear_model
import re
warnings.filterwarnings('ignore')
1.2.2 US overall
We want to first look at the overall confirmed and death cases in the US. Here we read the data of the confirmed covid cases throughout the whole world from 1/22/20 till today. For this project, We would focus on the united states.
Below is the global confirmation data. It includes all the countries, their latitude, longitude, and the daily cumulative confirmation among all these countries.
world_conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", sep=',')
world_conf.head()
We first extract the confirmation data for the US from the world data frame. We then calculated the increase of confirmation covid for every day in between and transposed it afterward.
us_conf = pd.melt(world_conf, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='conf_cases')
us_conf = us_conf.drop(columns=['Province/State', 'Lat', 'Long'])
us_conf = us_conf.rename(columns={'Country/Region': 'Country'})
us_conf["Date"] = pd.to_datetime(us_conf['Date'])
us_conf = us_conf.groupby(['Country', 'Date']).sum()
us_conf["Next_day"] = us_conf['conf_cases'].shift(fill_value=0)
us_conf["conf_change"]= us_conf['conf_cases'] - us_conf['Next_day']
us_conf = us_conf.drop(columns=['Next_day'])
us_conf = us_conf.reset_index()
us_conf = us_conf[us_conf["conf_change"] >= 0]
us_conf = us_conf[us_conf["Country"]=="US"]
us_conf = us_conf.set_index("Date")
us_conf = us_conf.drop(columns=['Country'])
us_conf.head()
Below is the global death data. It includes all the countries, their latitude, longitude, and the daily cumulative death among all these countries.
world_death = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", sep=',')
world_death.head()
We did the same thing for the death data -- we calculated the increase of death cases for every day in between and transposed it afterward.
us_death = pd.melt(world_death, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='death_cases')
us_death = us_death.drop(columns=['Province/State', 'Lat', 'Long'])
us_death = us_death.rename(columns={'Country/Region': 'Country'})
us_death["Date"] = pd.to_datetime(us_death['Date'])
us_death = us_death.groupby(['Country', 'Date']).sum()
us_death["Next_day"] = us_death['death_cases'].shift(fill_value=0)
us_death["death_change"]= us_death['death_cases'] - us_death['Next_day']
us_death = us_death.drop(columns=['Next_day'])
us_death = us_death.reset_index()
us_death = us_death[us_death["death_change"] >= 0]
us_death = us_death[us_death["Country"]=="US"]
us_death = us_death.set_index("Date")
us_death = us_death.drop(columns=['Country'])
us_death.head()
We then joined the tables into a data frame us_overall. The new data frame has confirmed cases, daily confirmed the change, death cases, and daily death change data all in one.
us_overall = us_conf.join(us_death, how='outer')
us_overall.head()
1.2.2 US states
Below is the confirmed cases in each state of the United States. Here we would also want to sort out some states that is representative of a certain area. Below is the states we picked for this project, we selected one state for each of the 9 regions.
We first read in the data from the Hopkins site.
conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv", sep=',')
conf.head()
We concact the confirmation data for all the counties located in the nine states we selected into nine entries, each one of them have a specific cumulative number of confirmation per day. And eventually, for the sake of better data representation, we transposed the data frame.
MD = conf[conf["Province_State"] == "Maryland"]
frames = [MD]
confirmed = MD.drop(conf.columns[0:11], axis=1)
confirmed = confirmed.append(confirmed.sum(numeric_only=True), ignore_index=True)
confirmed.drop(confirmed.index[0:26], inplace=True)
list1 = ["Maine", "New York", "Wisconsin", "Kansas", "Alabama", "Texas", "Arizona", "California"]
for x in list1:
state = conf[conf["Province_State"] == x]
frames.append(state)
time = state.drop(state.columns[0:11], axis=1)
sum = time.append(time.sum(numeric_only=True), ignore_index=True)
confirmed = confirmed.append(sum.sum(numeric_only=True), ignore_index=True)
confirmed = confirmed.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
result = pd.concat(frames)
confirmed = confirmed.swapaxes("index", "columns")
confirmed.index = pd.to_datetime(confirmed.index)
confirmed.head()
We then read in number of deaths by state in the US.
us_dead = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv", sep=',')
us_dead.head()
Then we did the same thing to the deaths data for the nine states: we concact the death data for all the counties located in the nine states we selected into nine entries, each one of them have a specific cumulative number of death daily, and we transposed the data frame.
MD2 = us_dead[us_dead["Province_State"] == "Maryland"]
death = MD2.drop(MD2.columns[0:12], axis=1)
death = death.append(death.sum(numeric_only=True), ignore_index=True)
death.drop(death.index[0:26], inplace=True)
for x in list1:
state2 = us_dead[us_dead["Province_State"] == x]
time2 = state2.drop(state2.columns[0:12], axis=1)
sum2 = time2.append(time2.sum(numeric_only=True), ignore_index=True)
death = death.append(sum2.sum(numeric_only=True), ignore_index=True)
death = death.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
death = death.swapaxes("index", "columns")
death.index = pd.to_datetime(death.index)
death.head()
us_overall.plot(y="conf_cases", legend=None, figsize=(15,10))
Through the plot above, it is reasonable to conclued that the growth rate is coming to a flat point after the sudden surge around June 2022. One possible reason for the sudden surge we suspect is that June and July are the popular months that people would like to go on vacation at. This might increases the chance of contact. Another reason could be that there was a policy change over mask wearing, as a result more and more people stop wearing mask as they used to. Not wearing mask might also be a reason that lead to the increase of infection.
We then utilized the daily increase for covid confirmation to plot a daily increase plot. This plot illustrate the rate of change better than the one we got above.
us_overall.plot(y="conf_change", legend=None, figsize=(15,10))
What we discovered in the first plot is also illustrated by this plot. Indeed, this plot captures more change comparing to the first one because of the nature of derivatives. As we can see, the surge in confirmation numbers is more drastic in this plot.
2.1.2 Deaths trend in the US
We then utilized the data in our us_overall to plot for the cumulative death rate.
us_overall.plot(y="death_cases", legend=None, figsize=(15,10))
There has always been a steady growth in the cumulative death rate just like that of cumulative confirmation rate. However, one noticeble fact we could see from this plot is around the time that a sudden surge occurrs to the covid confirmation number, the total death number did not change much. We suspect that is due to the fact that the sympton of covid is more mild comparing to how it was in the very beginning
Down below is the daily increase plot we plotted with the data from our us_overall data frame.
us_overall.plot(y="death_change", figsize=(15,10), legend=None)
This plot could also illustrates the point we claimed above. Death rate through July 2021 to 2022 flunctuates between 1000 to 4000 person per day, but there is no specific pattern that shows any sudden surge around June 2022.
Now lets look at the nine specific states we talked about in the first part.
confirmed.plot(figsize=(15,10))
Cases in nine states continue to trend upward. It is obvious that the number of confirmed cases in June of 2022 increased significantly, which is good because none of the states above showed a tendency that is opposite to that of the overall confirmation tendency. Also, it is worth noticing that more states by the costs have higher cumulative comfirmation rate. This is mainly due to the fact that coast areas have a higher population, and they are more likely to cointain a higher porportion of wai lai ren kou!
death.plot(figsize=(15,10))
We then did the same thing to the 9 states selected as what we did to the global confirmed and death database-- we calculated the daily increase confirmation rate throught the nine selected states and transposed the dataframe.
result = result.drop(result.columns[[0,1,2,3,4,7,10]], axis=1)
result = pd.melt(result, ['Admin2','Province_State', 'Lat', 'Long_'], var_name="Date", value_name='Cases')
result = result.drop(columns=['Province_State'])
result = result.rename(columns={'Admin2': 'Admin', 'Long_': 'Long'})
result["Date"] = pd.to_datetime(result['Date'])
result = result.groupby(['Admin', 'Date']).sum()
result["Next_day"] = result['Cases'].shift(fill_value=0)
result["Daily_change"]= result['Cases'] - result['Next_day']
result = result.drop(columns=['Next_day'])
result = result.reset_index()
result = result[result["Daily_change"] >= 0]
To illistrate the change better, we put them into the US map and represent the daily increase as a heat map. Below is the code we used to generate the heat map.
result["Date"] = result["Date"].astype(str)
fig = px.scatter_geo(result, lat="Lat", lon="Long",
hover_name="Admin", size="Daily_change",size_max=80,
animation_frame="Date",
scope = "usa",
title = "Total Cases")
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 50
fig.show()